Audio pattern recognition - Music genre classification - Paolo Cortis

Running experiments with different machine learning models for a multi class music genre classification task.

Libraries

Data Retrieval and feature extraction

Data Preprocessing and visualization

Samples-features ratio

If the number of features is higher than the number of samples the model will behave erratically and overfit.

NaN values

NaN values have a detrimental impact on the performance of the model.

Label encoding

Time Domain

Frequency domain

Spectrogram, mel spectrogram, Mfccs (3s window)

Visualizing the data helps to evaluate the difficulty of the task and to design models with the appropiate complexity.

First, a parallel plot showing the value distribution of the main features with respect to the label to determine wheter these features are meaningful for the classification task, and to what extent.

Next the PCA data decomposition technique is used to reduce the high dimensional dataset into fewer dimensions (two), while preserving spatial characteristics for visualization purposes.

PCA

Technique to reduce the dimensionality of a dataset by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data while preserving most of the original data variation.

The following is a visualization of the feature space considering the features apart from mfccs as well as mfccs only.

The simple clustering algorithm K-means is also applied to evaluate the task difficulty and the feature space.

Boruta Feature Selection

The algorithm gives a numerical estimate of the feature importance and categorizes features in those to keep, those to discard and tentative, meaning that there is uncertainty regarding their actual correlation with the output. In this case it's used to evaluate features and determine if some are not useful for the task at hand.

Normalization via robust scaler

Data normalization is proven to improve the performance of models.
The robust scaler operates normalization subtracting the median and dividing by the standard deviation between the interquantile range. This is operates a z-scoring while being robust to outliers.

Feature selection

Feature selection implies the choice of a subset of the original pool of features.
This is a useful step as irrelevant or partially relevant features can negatively impact the model performance, hence feature selection reduces noise, improves model accuracy and reduces training time by cutting the total number of features.

Correlation with the output tests

Features not correlated with the output do not carry relevan information for the model to exploit for the task at hand. These can be removed.

In the following section functions are defined and ran for each of these tests over the entire dataset. This is done here just for explanatory purposes, in fact the resulting correlated features are not dropped from the dataset as this would introduce bias due to data leakage. Instead these tests are saved for the training "main loop" and ran, for each holdout, over the training portion of the data only, maintaining test data unseen and reserving it for actual testing.

Pearson correlation test

Spearman correlation test

Maximal information coefficient

Drop feature uncorrelated with the output

Correlation with the features tests

Features correlated with each other carry the same kind of information. These can be removed.

Pearson correlation test

Identifies linear correlation between features.

One hot encoding labels

Learning

The technique used to generate the holdouts for the evaluation of the various architectures is the stratified Monte-Carlo method, producing each time a different arrangement of the training, validation and testing set, while keeping roughly the same class balance.

10 holdouts are considered to train and evaluate each model, 20% of the whole dataset is used for testing, 80% for training where 20% of it is reserved for the validation of the meta-model.

To avoid increasing excessively the computational complexity of the main training loop, the hyperparameter tuning step is performed, for all holdouts, considering a single split of the training set in a training and validation portion.

Evaluate a model

Next are a set of functions to evaluate and visualize the performances of models.

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Hyperparameter tuning

Hyperparameter tuning is performed training a meta model over training and validation data to obtain the best set of hyperparameters via Bayesian optimization.

Here is the function to train the hypermodel.

MLP

A multi-layer perceptron is a feed forward neural network consiting of one input and output layer with hidden dense layers in-between. To avoid overfitting of the model dropout layers were added.

This type of models is applied to classify audio genre given the 1D feature vectors.

Here are the functions to build the hypermodel for hyperparameter tuning and to build the model with the best set of hyperparameters found.

To evaluate the performance of the optimized models and determine whether Bayesian optimization provides better performances, fixed hand-made architectures deemed to be good with respect to the complexity of the task are also defined for comparison.

Main training loop - MLP

When using fixed hand-made architectures:

When applying hyperparameter tuning:

Fixed MLP

Tuned MLP

Tuned MLP on other features only (not Mfccs)

It is literature consensus that mfccs are very powerful features for audio pattern recognition, in particular for this music genre classification task.
In order to determine to what extent other features are relevant and useful for the task the following models are tuned on the 30s and 3s window excluding mfccs mean and variance.

CNN

A convolutional neural network consists a set of convolutional layers (convolution+maxPooling) and a set of dense layer. To avoid overfitting of the model dropout layers were added.

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Hyperparameter tuning is not performed in this case as the literature found specific complex configurations hard to replicate with tuning with the low amount of data provided by the GTZAN dataset. These fixed architectures still provide good results.

Here are the functions to evaluate the model performances and to build and train the model.

These architectures handle well 2D data retaining spatial information, hence the data used to train and evaluate them are:

Main training loop - CNN

30s Window Mel Spectrogram (GTZAN) CNN

3s Window STFT Spectrogram CNN

3s Window Mel Spectrogram CNN

3s Window Mfccs CNN

Multi modal neural network

A multi-modal neural network is a model capable of jointly represent and exploit the information of both data modalities (audio features and mfccs). To shape this network the layers of a MLP and a CNN are concatenated, each receiving their specific input, the outputs are then combined into a final dense layer and output layer.

The MMNN is here composed by a MLP and a CNN, which are then concatenated and trained as a whole.

To have a statistically sound estimate of an architecture performance, multiple models are built and trained, each with the same architecture, over different portions of the data (holdouts) and, the average performance of those, is considered as an estimate of the overall performance of the architecture.

Here are the functions to align data in the two modalities, build the models, and to build and train the MMNN.

Main training loop - MMNN

Complete results visualization - Barplots

Hereby all the results from all the different architectures designed are grouped and visualized via barplots.

The following cells format the results for the barplot visualization.